README for Simulation Programs (Program_1 – Program_4)

Title:
Simulation Pipeline for LLM-Based Policy Attitude Generation Using CGSS 2021 Demographic Data

Purpose:
This set of Python scripts replicates human survey responses by transforming demographic records from CGSS 2021 into narrative prompts, submitting them to a large language model (gpt-3.5-turbo), and recording simulated responses across 10 policy issues.  
The pipeline ensures consistency between human samples (CGSS 2021) and silicon samples (LLM outputs) through structured demographic mapping and controlled prompting.

--------------------------------------------
Overview of the Four Programs
--------------------------------------------
Program | Function | Input | Output | Relationship
---------------------------------------------------
Program_1 | Core API communication module | System prompt + user prompt | Model-generated response | Provides the LLM query function used by all other programs
Program_2 | Demographic narrative generator | Raw demographic dataset (data.csv) | Formatted demographic “backstories” | Must be executed first; provides backstory templates for later programs
Program_3 | Main simulation (overall sample) | Outputs from Program_2 + question list | Complete simulated dataset across 10 policy issues | Uses Program_1 and Program_2; iterates through all respondents
Program_4 | Stratified simulation (subgroup analysis) | Outputs from Program_2 + question list | Subgroup-level simulated datasets (e.g., gender, political status) | Extends Program_3; adds demographic stratification logic

--------------------------------------------
Logical Flow of Execution
--------------------------------------------
Step 1. Initialize API connection (Program_1)
- Defines function do_query()
- Establishes secure OpenAI API connection
- Sends structured prompts and retrieves model responses
- Provides core querying capability to all other programs

Step 2. Convert demographic data into narrative format (Program_2)
- Reads CGSS 2021 demographic data
- Converts structured variables (gender, education, income, etc.) into first-person sentences
- Generates templates and mappings for later use

Step 3. Generate simulated responses for overall sample (Program_3)
- Reads 10 selected CGSS policy questions
- Combines demographic backstories with survey items
- Submits prompts to LLM and stores responses in CSV files
- Tests whether LLM reproduces population-level response patterns

Step 4. Generate subgroup simulations (Program_4)
- Extends Program_3 to stratified analysis
- Defines strata by gender, religion, education, income, political status, etc.
- Generates subgroup-specific responses for comparison

--------------------------------------------
Logical Relationships Among Programs
--------------------------------------------
Program_1  →  Core LLM Query Engine
    ↓
Program_2  →  Generates Demographic Narratives
    ↓
Program_3  →  Overall Simulation
Program_4  →  Stratified Simulation (extends Program_3)

--------------------------------------------
Notes on Reproducibility and Extensibility
--------------------------------------------
- API Consistency: Uses same model parameters (gpt-3.5-turbo-0613, Temperature=1, Top P=1)
- Data Synchronization: All runs include “今天是2021年6月1日” to match CGSS 2021 timeline
- Retry Logic: Each query retries up to five times to handle API limits
- Scalability: Modular design allows replacement of do_query() with other APIs (Claude, Gemini, ChatGLM, etc.)